Controlling Variable Selection By the Addition of Pseudo Variables

نویسندگان

  • Yujun Wu
  • Dennis D. Boos
  • Leonard A. Stefanski
  • Marc G. Genton
  • Hao Helen Zhang
چکیده

WU, YUJUN. Controlling variable selection by the addition of pseudo-variables. (Under the direction of Dr. Dennis D. Boos and Dr. Leonard A. Stefanski) Many variable selection procedures have been developed in the literature for linear regression models. We propose a new and general approach, the False Selection Rate (FSR) method, to control variable selection with the advantage of being applicable to a broader class of regression models; for example, binary regression, Poisson regression, etc. By adding a number of pseudo-variables to the real set of data and monitoring the proportion of pseudo-variables falsely selected in the model, we are able to control the model false selection rate, selecting as many important variables as possible while selecting a relatively low proportion of false important ones. We focus on forward selection because it is applicable in the case where there are more variables than observations. Due to the difficulty of obtaining analytical results, we study our approach by Monte Carlo and compare it with a variety of commonly used procedures. We first focus on linear regression models, and then extend the approach to logistic regression models. The new method is illustrated on four real data sets. Controlling Variable Selection By the Addition of Pseudo-Variables by Yujun Wu A Dissertation submitted to the advisory committee on graduate studies of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy DEPARTMENT OF STATISTICS Raleigh, NC August, 2004 APPROVED BY: Dennis D. Boos Leonard A. Stefanski Co-Chair of Advisory Committee Co-Chair of Advisory Committee Marc G. Genton Hao Helen Zhang To Lin and my parents

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Predictive Risk Mapping of Leptospirosis for North of Iran Using Pseudo-absences Data

Leptospirosis is a common zoonosis disease with a high prevalence in the world and is recognized as an important public health drawback in both developing and developed countries owing to epidemics and increasing prevalence. Because of the high diversity of hosts that are capable of carrying the causative agent, this disease has an expansive geographical reach. Various environmental and social ...

متن کامل

Selection of Variables that Influence Drug Injection in Prison: Comparison of Methods with Multiple Imputed Data Sets

Background: Prisoners, compared to the general population, are at greater risk of infection. Drug injection is the main route of HIV transmission, in particular in Iran. What would be of interest is to determine variables that govern drug injection among prisoners. However, one of the issues that challenge model building is incomplete national data sets. In this paper, we addressed the process ...

متن کامل

An Overview of the New Feature Selection Methods in Finite Mixture of Regression Models

Variable (feature) selection has attracted much attention in contemporary statistical learning and recent scientific research. This is mainly due to the rapid advancement in modern technology that allows scientists to collect data of unprecedented size and complexity. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a sma...

متن کامل

Supplier selection among alternative scenarios by Data envelopment analysis

A considerable problem in competitive trade world is choosing the best supply chain. As a result in much more serious circumstances of competitions looking for the best supplier for manufacturing, for preparing raw material, is very significant. Meantime suppliers have different scenarios to be fulfilled, such as changing selection variables like lead-time, transportation cost and transportatio...

متن کامل

Application of genetic algorithm (GA) to select input variables in support vector machine (SVM) for analyzing the occurrence of roach, Rutilus rutilus, in streams

Support vector machine (SVM) was used to analyze the occurrence of roach in Flemish stream basins (Belgium). Several habitat and physico?chemical variables were used as inputs for the model development. The biotic variable merely consisted of abundance data which was used for predicting presence/absence of roach. Genetic algorithm (GA) was combined with SVM in order to select the most important...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004